There are four types of errors generated by the HSM:
· Fatal errors.
· Non-recoverable errors.
· Recoverable errors.
· Programming errors.
Fatal errors indicate a hardware fault in the equipment. Such an error should be logged and reported for user action to be taken (e.g., report to supervisor). Fatal errors are normally reported on the console and are not seen by the host application. The host application usually times out if a fatal error occurs.
Non-recoverable errors cannot be rectified by the program and need user intervention (e.g., with the HSM set into the Authorised state). Such errors should also be logged and reported for user action to be taken (e.g., report to supervisor). This type of error does not mean that the HSM cannot action other types of commands.
Recoverable errors may be the result of data corruption or indicate that the HSM cannot process a command because some other action is required first. The application should attempt to recover by re-issuing the command, attempting to clear the corruption or by implementing the missing action (e.g., the HSM reports that the print format definition is not loaded, so the program should load it and re-issue the failed command).
Programming errors are normally found during testing, but if they occur at other times, they are probably non-recoverable.
Additionally the application should monitor the HSM for timeouts on the interface.
In any of the above events, the application should try to continue processing by using another HSM to action the command. Continued failure may indicate a catastrophic failure of all HSMs (unlikely), a power failure or a program error.
The application should monitor usage of all HSMs and mark any unit as "out of service" if it has given a fatal error, or where a unit repeatedly reports non-recoverable errors.
Hardware failures, software errors and alarm events are recorded in the Error Log. This has 100 slots, with fields for error code, sub-code, date, time and severity level. There are four severity levels: informative, recoverable, major and catastrophic (needing a reboot). When an error is recorded for a particular error code, any subsequent error with the same code updates the date and time for that code., thus each error type remains in the log until it is cleared.
Error log maintenance is performed from the console using the command ERRLOG to retrieve the log and CLEARLOG to clear the log. Once the log has been read, the flashing Fault LED changes to a steady illumination. If the log is cleared, the Fault LED is extinguished.